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ABSTRACT 

Over the past ten years Bayesian methods have rapidly grown more popular in many 
scientific disciplines as several computationally intensive statistical algorithms have be- 
come feasible with increased computer power. In this paper, we begin with a general 
description of the Bayesian paradigm for statistical inference and the various state- 
of-the-art model fitting techniques that we employ (e.g., the Gibbs sampler and the 
Metropolis-Hastings algorithm). These algorithms are very flexible and can be used to 
fit models that account for the highly hierarchical structure inherent in the collection 
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of high-quality spectra and thus can keep pace with the accelerating progress of new 
space telescope designs. The methods we develop, which will soon be available in the 
Chandra Interactive Analysis of Observations (CIAO) software, explicitly model photon 
arrivals as a Poisson process and, thus, have no difficulty with high resolution low count 
X-ray and 7-ray data. We expect these methods to be useful not only for the recently 
launched Chandra X-ray observatory and XMM but also new generation telescopes such 
as Constellation X, GLAST, etc. In the context of two examples (Quasar S5 0014-1-813 
and Hybrid-Chromosphere Supergiant Star a TrA) we illustrate a new highly struc- 
tured model and how Bayesian posterior sampling can be used to compute estimates, 
error bars, and credible intervals for the various model parameters. Application of our 
method to the high-energy tail of the ASCA spectrum of a TrA confirms that even at a 
quiescent state, the coronal plasma on this hybrid-chromosphere star is indeed at high 
temperatures (> 10 MK) that normally characterize flaring plasma on the Sun. We are 
also able to constrain the coronal metallicity, and find that though it is subject to large 
uncertainties, it is consistent with the photospheric measurements. 

Subject headings: methods:data analysis — methods:statistical — stars:individual(alpha 
TrA) - stars:activity — (galaxies:)quasars:individual(Quasar S5 0014-1-813) — X-rays:general 

1. Introduction 

The ever increasing power and sophistication of today's high energy instruments is opening a 
new realm of high quality data that is quickly pushing beyond the capabilities of the "classical" 
data-analysis methods in common use. In this paper, we present an innovative implementation of 
state-of-the-art statistical methods for fitting high resolution spectra from the Chandra X-ray Ob- 
servatory. The common "folk-wisdom" of how to bin data, subtract background counts, propagate 
errors, and, for example, estimate the significance of a spectral line profile are unreliable and can 
lead to unacceptable results (see Loredo 1993; Nousek 1993; Feigelson & Babu 1997; Siemiginowska 
et al. 1997; Zimmerman 1997, for discussion). For example, binning data sacrifices the resolution 
of the instrument, subtracting background can lead to negative counts with unpredictable results, 
and statistical black boxes such as the and Cash statistics (Lampton et al. 1976; Cash 1979) al- 
though often useful, may not be equipped to answer standard questions (e.g., van Dyk & Protassov 
2000). Some authors have suggested solutions to such problems which involve ad hoc adaptations 
of commonly used methods (e.g., Gehrels 1986; CoUura et al 1987; Mighell 1999). Unfortunately, 
when such solutions are not rooted in a theoretical framework, they have no justification beyond 
problems which are more-or-less the same as the simulation studies which justify them and we 
are often forced into additional ad hoc adaptations. This approach is difficult to justify in light 
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of modern statistical methods which address reasonable model assumptions directly. Thus, in re- 
cent years, astrophysicists have increasingly turned to likelihood based (e.g., Lucy 1974; Cash 1979; 
Schmitt 1985; Sciortino &: Miccla 1992) and Bayesian methods (e.g., Bijaoui 1971; Richardson 1972; 
Gregory & Lorcdo 1992; Lorcdo 1993; Connors 1997; Siemiginowska 1997; Kashyap & Drake 1998; 
Freeman et al. 1999). The primary purpose of this paper is to illustrate how Bayesian methods 
can provide practical answers to outstanding real problems which standard methods are not able 
to handle. The methods described here are equipped with readily available parameter estimates, 
credible intervals, error bars, model checking techniques, methods for combining information from 
multiple sources, etc. — all within a flexible theoretical framework and without reliance on asymp- 
totic Gaussian approximations. 

We illustrate Bayesian data analysis via two detailed examples. The analysis of Quasar S5 
0014-1-813 offers a straight forward introduction to our methods and the extremely low count 
Hyprid-Chromosphere Supergiant Star a TrA observation shows how we tackle a previously in- 
tractable analysis. Together, these examples demonstrate the power of Bayesian methods to han- 
dle highly structured models designed to reflect the structure in both the source spectrum and 
the data collection process. Our methods avoid the binning of counts and thus the sacrificing of 
high-resolution information required by standard data analysis methods. The analysis of Quasar 
S5 0014+813 is consistent with the available standard analysis which relies on extra binning and 
the removal of the high-energy low-count tail. 

We emphasize that we model not only the source spectrum, but also other stochastic com- 
ponents of data collection and the instrument such as background contamination and instrument 
response. Generally we refer to our stochastic representation of the entire process as the (statistical) 
model. For clarity we refer to the spectral or physical model as the source m,odeA and to the model 
for the observed (PHA^) counts as the observed- data model. In our detailed example, we develop a 
model and algorithms for spectral analysis of high-energy (or other) data using a Poisson^ process 
for photon arrivals. We allow for: stochastic instrument response via a photon redistribution ma- 
trix; the absorption of photons; the effective area'^ of the telescope; and background contamination 
of the source. In particular, we model information on background emissions as the realization of a 



^Pulse Height Amplitude, originally in proportional counters, the number of electrons produced by a photon, 
hence the amplitude of the current pulse registered by the detector electronics. The term now refers to the measure 

of the energy deposited on the detector (as opposed to the true energy). 

^Recall, a random variable X is said to follow a Poisson distribution with parameter or intensity A if Pr(X = 
x) = e~^y lx\. In this case E(X) = A and we often write X ~ Poisson(A) (read as X is distributed as Poisson with 
intensity A). This representation conditions on the intensity parameter, A, which in turn may vary. 

* The effective area of the telescope is the fraction of the true geometric area that the telescope presents to sky. 
This varies with energy. 
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second Poisson process (cf. Loredo 1993), thereby eliminating the need to directly subtract off the 
background counts and the rather embarrassing resulting problem of negative photon counts. The 
source energy spectrum is modeled as a mixture of several (Gaussian) line profiles and a Gener- 
alized Linear Model (GLM)^ (e-g-, McCullagh & Nelder 1989) which accounts for the continuum. 
GLMs have become the standard statistical method for incorporating information contained in 
independent variables (as in regression) into many non-Gaussian models and are thus an obvious 
but innovative choice in this setting. 

In addition to several Markov chain Monte Carlo (MCMC) algorithms, we describe and use 
data augmentation, an important statistical method for Baycsian (and other) analysis. Data aug- 
mentation is an elegant compTitational construct allowing tis to take advantage of the fact that if 
it were possible to collect additional data, statistical analysis would be greatly simplified. This is 
true regardless of why the so-called "missing data" are not observed. For example, if we were able 
to record the counts due to background contamination in addition to total counts in each bin, it 
would, of course, be a trivial task to account for the background. There is a large class of powerful 
statistical methods designed for "missing data" problems. With the insight that "true" values of 
quantities recorded with measurement error can be regarded as "missing data," these methods can 
usefully be applied to almost any astrophysical problem. In particular, we can treat the true image 
(before instrument response), the absorbed photon counts, and unbinned energies as "missing data" 
to account for instrument response, absorption, and binning, respectively. 

This introduction fore shadows the tone of the paper: in the process of developing new Bayesian 
methods, we describe and utilize state-of-the-art statistical reasoning, methods, and algorithms, 
whereby we explore a larger statistical framework for our problem of interest. Although we in- 
troduce many new tools and use terminology that may be unfamiliar, we endeavor to write in a 
manner accessible to astrophysicists and believe the resulting methods justify the required inter- 
disciplinary work. Table 1 indexes terminology used in the paper that may be unfamiliar to some 
readers. To aid in the translation from standard statistical notation to standard astrophysical us- 
age, many equations have been written according to both standards when notation is introduced. 
One notational convention is worthy on mention; we use superscripts to identify model components 
(e.g., background or absorption). 

The paper is organized as follows. After a brief overview of the fundamentals of Bayesian 



®In a GLM we assume a transformation (e.g., log) of the model is linear in a set of independent variables. We 
emphasize that this is not equivalent to transforming the data and proceeding with linear regression. A generalized 
linear model utilizes the likelihood of the assumed model which may not be Gaussian. (E.g., the assumed model may 
be Poisson.) See Section 3.3 for details. 
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analysis in Section 2, we lay out our hierarchical statistical model^ which summarizes the photon 
collection process and parameterizes many relevant aspects of the energy spectrum in Section 3. 
Two examples which aim to illustrate a typical data analysis and the advantages of Baycsian 
methods in this setting are given in Section 4. Section 5 contains brief concluding remarks. Finally, 
in two appendices, we outline such important general MCMC methods as data augmentation, the 
Gibbs sampler, the Metropolis-Hastings algorithm, and judging convergence using multiple chains. 
We also describe in detail how we use these algorithms to fit the hierarchical source model of 
Section 3. 



2. Bayesian Analysis 

In this section we outline several important methodological and computational issues involved 
with Bayesian analysis using a simple model that accounts for background in a simplified Poisson 
process to motivate and illustrate ideas. Our introduction is brief and we encourage interested 
readers to consult one of the several high-quality recent texts on the subject such as Gelman, 
Carlin, Stern, & Rubin (1995), Carlin & Louis (1996), Gilks, Richardson, & Spiegelhalter (1996), 
and Sivia (1996). In Section 3, we show how the ideas developed here can be used for a detailed 
spectral analysis. 



2.1. Prior, Sampling, and Posterior Distributions 

Bayesian probability analysis is fundamentally based on one simple result known as Bayes's 
Theorem which allows us to update a probability distribution based on new data or other informa- 
tion. In particular, knowledge about a (vector) model parameter, 6, is summarized by a probability 
distribution, p{6), such that Pi{0 e R) = Jj^p{0)d6 for any region^ R. Bayes's Theorem states 

where Y are the observed data or other new information pertaining to and I represents any 
initial information known before Y is observed. Here, p{0\T) represents our knowledge prior to 



hierarchical (statistical) model is formulated in terms of unobserved quantities, which are themselves statisti- 
cally modeled. For example, we may assume photons first arrive at a detector according to a Poisson process and 
them axe randomly redistributed according to a photon redistribution matrix. A hierarchal model separates these 
two random process into two levels of a structured model. 

'^The notation Pt{0 € R) represents the probability that is in the region R and is computed as the integral of 
the probability distribution of over R. 
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observing Y and is called the prior distribution. The sampling distribution or likelihood, p{Y\0,T) 
represents the likelihood of the data given the model parameters, and p{6\Y,T) represents our 
updated knowledge regarding 6 after observing Y and is called the posterior distribution. Finally, 
p(Y|I) represents the unconditional distribution of Y and acts as the normalizing constant for 
p{0\Y,I). The functional form of Baycs's Theorem describes how our prior knowledge should be 
updated in light of information contained in the data. The likelihood or sampling distribution is 
the basis for many standard statistical techniques, while the prior and posterior distributions are 
specific to Bayesian analysis. 

To illustrate Bayes's Theorem, suppose we have observed counts, Y, contaminated with back- 
ground in a (source) exposure and have observed a second exposure of pure background. Through- 
out this section, we assume the source exposure is minutes and the pure background exposure 
is minutes with both exposures using the same area of the detector. (We generally use super- 
scripts to represent photon "sources," e.g., background or source. Occasionally we use superscripts 
for powers; for clarity we place powers outside parentheses.) To model the source exposure, we 
assume Y follows a Poisson distribution with intensity -|- A*^, where A^ and A*^ represent the 
expected counts during the source exposure due to background and source respectively. Thus, the 
likelihood is 

p(y|A-^,A^,I) = e-(^''+^'')(A^ + A^)^/y! for y = 0, 1, 2, . . . (2) 

We wish to estimate A*^ and treat A^ as a nuisance parameter, a parameter that is of little interest, 
but nuist be included in the model. As is detailed below, an important advantage of Bayesian 
methods is their ability to handle nuisance parameters by computing the marginal posterior distri- 
bution of the parameters of interest. The name "marginal" distribution originates with two-way 
tables of counts where the table margins sum over one of the variables to give the distribution of 
the other variable alone (e.g., its marginal distribution). Likewise the marginal distribution of the 
parameter of interest is computed by integrating over (i.e., averaging over) the nuisance parameter. 

At this point, we specify a prior distribution which allows us to include a priori knowledge (e.g., 
"allowed parameter ranges") from other experiments or other scientific information. One of the 
primary advantages of Bayesian analysis is a well-defined mechanism for the inclusion of information 
outside the current data set. In the absence of prior information, we use diffuse or so-called non- 
informative priors, which are ordinarily flat and have minimal influence on the final analysis. The 
prior distribution itself may be conveniently parameterized using a set of hyperparameters that can 
be varied to represent the researcher's knowledge about the value of the model parameters and the 
degree of certainty of this knowledge. For example, we use the 7 distribution^ to parameterize prior 



®The 7 (q, P) distribution is a continuous distribution on the positive real line with probability density function 
p{Y) = /3"Y"~^e~''^/r(a), expected value a/f3, and variance a/f3'^ for positive a and /3. 
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information for and A'^; 

A-^|I ~ 7(a^, and A^|I ~ 7(0^, (3) 

i.e., 

p(A^|I) = (/3^)«"(A^)«"-^e-^"^"/r(a^) and p(A^|I) = (A^)"'-^e-/^'^'/r(a^), (4) 

where the notation ~ is read "follows the distribution" and A^ and A'^ are assumed a priori 
independent. The 7 prior on A^ is mathematically equivalent to a Poisson likelihood resulting 
from a count equal to — 1 obtained with an exposure of times that of the source exposure. 
(By mathematically equivalent, we mean the prior on A^ is proportional to a Poisson likelihood 
as a function of A^.) This leads to a natural choice of p{X^\I) = 7 (a^ = + 1,(3^ = /t'^), 
where arc the counts from the background exposure. Notice that here (and throughout the 
paper) we explicitly incorporate information from the background exposure into the analysis via 
the prior distribution on A^. Thus, the counts from the source exposure, Y, are treated as the 
observed data Y in Equation 1. We refer interested readers to Gelman et al. (1995, Section 2.7) 
for further discussion and examples of the 7 prior distribution with Poisson data. 

The equivalence of the 7 prior for A"^ and a"^ — 1 counts during an exposure of minutes 
leads to a natural interpretation of the hyperparameters — for a relatively non-informative prior we 
choose much less than one. To illustrate this, we consider two priors, one non-informative and 
improper^, p(A'^|I)[^l = 7(1,0) oc 1, (illustrated as a dotted line in Figure 1); and one informative, 
where, let us say we know from other means that three counts are to be expected in the same 
exposure time, hence p(A'^|I)[^l = 7(4, 1) (solid line in Figure 1). This choice of informative prior is 
only an example — 7(4, 1) corresponds to Poisson likelihood resulting from 3 counts with an exposure 
time equal to the source exposure (ten minutes). This is a rather informative prior distribution 
and is chosen to illustrate the effect of very informative prior. The non-informative prior contains 
information equivalent to zero counts in an exposure of zero seconds. 

Using Bayes's Theorem, with = {X^,X^), we can combine the 7 priors and the likelihood 
given in Equation 2 to compute the posterior distribution, 

p(A^, A^|y, I) cx e-[^"(/^"+i)+^"(^"+i)l (A^ + \^fiX''r'-\X^r'-\ (5) 

for A^ > 0, A^ > 0. 

®An improper distribution is a distribution that is not integrable and thus is not technically a distribution. One 
should use improper prior distributions only with great care since in some cases they lead to improper posterior 
distributions which are uninterpretable. 
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Nuisance parameters such as A pose a monumental difficulty for classical statistical analysis 
which often relies on fixing nuisance parameters at estimated values. Unfortunately, this does 
not account for uncertainty in their estimates and thus tends to be anti-conservative. (Likewise, 
fioating nuisance parameters or "propagating errors" when computing error bars is essentially a 
Gaussian assumption, which can lead to unpredictable results when such an assumption is not 
justified.) The Bayesian solution averages over the posterior distribution (i.e., the uncertainty) of 
the nuisance parameter by computing the marginal (posterior) distribution of the parameters of 
interest without Gaussian approximations. For example, the marginal posterior distribution of A*^ 
can be computed by (numerical) integration, 



and is illustrated for the two priors for A in the second plot of Figure 1, where we assign Y = 1 
and = 48. In this example, direct subtraction of background would leave a "negative count" of 
— 1; no such difficulty occurs with the Bayesian analysis. (See Loredo (1993) for another derivation 



Since the source count is small relative to the background count, we expect a small A . Al- 
though this is evident in both posterior distributions in Figure 1, the highly informative prior 
distribution centered at A*^ = 3 pulls the (solid) posterior towards higher values, thus illustrating 
the effect of an informative prior distribution. Such sensitivity analyses often play an important 
part in Bayesian (or other) data analyses, since they investigate the sensitivity of the results to the 
statistical assumptions (e.g., the choice of prior distribution). 

The posterior distributions should be interpreted as probability distributions representing the 
combined information in the prior and data. For example, a region, R, such that Jj^p{G\Y, T)dO = C 
is called a ^-level credible interval (or credible region if is multidimensional) and we can say 
Pr(0 e R\Y,I) = C (e.g., a 67%, 90%, or 95% credible region). The 90% credible region for the 
posterior distributions illustrated in Figure 1 are (0.77, 4.24) using the informative prior and (0.04, 
3.84) using the non-informative prior. Such probability statements arc measures of our information 
regarding the value of the parameter 6, given the data and prior information. This is in contrast 
to the more traditional frequentist definition of probability, which defines a probability to be the 
long term frequency of an event generally involving the data given 0. The posterior distribution 
is a complete summary of our information, but is often summarized by its mean, 9 = E(0|Y,I) 
and variance, Var(0|Y,I), or its modes and the curvatures at these modes. (The curvatures are 
most useful when the posterior is (locally) approximately Gaussian, as is asymptotically true under 
certain regularity conditions; e.g., Gelman et al. (1995) chapter 4.) In the following two sections we 
describe Monte Carlo methods for computing posterior means and variances and credible regions. 
To compute posterior modes (e.g., maximum likelihood estimates) van Dyk (2000) develops several 




(6) 



of the marginal distribution of A'^ in this setting.) 
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Expectation Maximization or EM algorithms for use in astrophysical applications. Posterior modes 
are often used to compute starting values for the more robust but computationally demanding 
Monte Carl methods; see Appendix A. 2. The EM algorithm gets its name because it iteratively 
maximizes the expected log posterior distribution of given the augmented data. 

Although a detailed description is beyond the scope of this paper, Bayesian methodology is 
well equipped for problems involving model selection. Methods based on Bayes's factors, computing 
the relative posterior probabilities of various competing models, and Bayesian 'p-values' are all 
important and remain areas of active statistical research (e.g., Gregory &: Loredo 1992; van Dyk & 
Protassov 2000). 



For univariate or small dimensional parameter spaces, we can usually compute the posterior 
mean, variance, credible regions, and other summaries either analytically or via non-stochastic nu- 
merical methods (e.g., Gaussian quadrature or Laplace's method). In higher dimensions, however, 
these methods can be difficult to implement partially because of the difficulty in finding the region 
where the integrand is significantly greater than zero. Thus, we often resort to Monte-Carlo in- 
tegration. In particular, if we can obtain a sample from the posterior, {0^q,t = 1,...,T}, Monte 
Carlo integration approximates the mean of any function, g, of the parameter with 



where we assume E[(7(0)|Y,I] exists. For example, g{d) = and g{6) = {0 — E(0|Y,I))(0 — 
E(0|Y,I))' lead to the posterior mean and variance respectively. Probabilities, such as C, = Pr(0 G 
R) can be computed using g{0) = I{6 G ii}, where the function / takes on value 1 if the condition in 
curly brackets holds and zero otherwise. Likewise, quantiles of the distribution can be approximated 
by the corresponding quantiles of the posterior sample. In short, a robust data analysis requires only 
a sample from the posterior distribution. A general strategy is to first sample from the posterior 
distribution and then approximate various integrals of interest via Monte Carlo. 



The Monte-Carlo approximation methods depend on our ability to obtain a sample from the 
posterior distribution. Although in some cases the posterior distribution is a well known distribution 
and trivial to sample, we must often use sophisticated algorithms to obtain a posterior sample. In 
Appendix A, we discuss three algorithms which have proven widely applicable in practice, the Data 



2.2. Evaluating the Posterior via Monte-Carlo Sampling 




(7) 



t=i 



2.3. Obtaining a Sample from the Posterior 
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Augmentation algorithm (Tanner & Wong 1987), the Gibbs sampler (Metropolis et al. 1953), and 
the Metropolis-Hastings algorithm (Hastings 1970). All of these algorithms construct a Markov 
chain with stationary distribution equal to the posterior distribution (e.g., Gelfand &: Smith 1990); 
i.e., once the chain has reached stationarity, it generates samples which are identically (but not 
independently) distributed according to the posterior distribution. These samples can then be used 
for Monte Carlo integration as described above; hence these algorithms are known as Markov chain 
Monte Carlo or MCMC methods (see Tierney 1996, for regularity conditions for using Equation 7 
with MCMC draws). Prom the onset then, it is clear that three important concerns when using 
MCMC in practice are (1) selecting starting values for the Markov chain, (2) detecting convergence 
of the Markov chain to stationarity, and (3) the effect of the lack of independence in the posterior 
draws. These issues are addressed in Appendix A. 2. 

The algorithms used to fit the models described in Section 3 rely on the method of data 
augmentation. The term 'data augmentation' originated with computational methods designed 
to handle missing data, but the method is really quite general and often useful when there is no 
missing data per se. In particular, for Monte Carlo integration we aim to obtain a sample from the 
posterior distribution, p{0\Y,I). In some cases, we can augment the model to p(0, X|Y, I), where 
X may be missing data or any other unobserved quantity (e.g., counts due to background). With 
judicial choice of X, it may be much easier to obtain a sample from p(6, X| Y, I) than directly from 
p{6\'Y, I). Once we have a sample from p{9, X| Y, I), we simply discard the sample of X to obtain 
a sample from p(6\'Y, I). In Appendix B.l, we describe how this method can be used for fitting the 
models described in Section 3. 



3. Fitting High-Resolution Low-Count Spectra 

3.1. Model Overview 

In this section, we describe a new class of (statistical) structured models which simultaneously 
describes high-resolution source spectra using Gaussian line profiles and a GLM for the continuum 
and accounts for background contamination of the image, instrument response, and absorption. 

The model may easily be generalized to account for different line profiles such as the Lorentzian 
distribution (e.g., Meng & van Dyk 1999). The (statistical) model is designed to summarize the 
distribution of photon energies arriving at a detector, which are recorded as counts in a number of 
energy channels (e.g., as many as 4096 on Chandra/ ACIS). Newly developed detectors have much 
higher resolution than their predecessors, and thus smaller expected counts per bin. Independent 
Poisson distributions are therefore more appropriate for the counts than the commonly used Gaus- 
sian approximation (e.g., fitting). We parameterize the intensity in bin j G J7 = {1, . . . , J}, as 
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the sum of a continuum term and K Gaussian lines. That is, the expected true counts per bin 
for a "perfect" instrument with effective area everywhere equal to the maximum possible effective 
area^° is 

model intensity = continuum + lines absorption, for each energy bin 

or more formally, 

(8) 

k=i 

where dEj is the known width of bin j, f{0 ,Ej) is the expected number of counts per keV 
per maximum effective area from the continuum and is a function of the continuum parameter, 
6*^ , Ej is the known mean energy in bin j, A'^ are the expected counts per maximum effective 
area from line k, pj{fj,^,v^) is the probability that a Gaussian random variable with mean fi^ and 
variance v'^ falls in bin j, and u(6^, Ej) is the probability that a photon in bin j is not absorbed. 
Specific forms for the continuum and absorption terms are discussed below in Sections 3.2 and 
3.4, respectively. The superscripts on the model parameters (0) are mnemonic and represent 
absorption (A), background (B), continuum (C), and the lines k = 1,...,K. The collection of 
parameters, e'^,e'' = {X^^iJ'.v^) for A; G /C = {1, . . . and 6^ (along with 6^ defined below) 
are represented by G. An artificial example, with power law continuum, two spectral lines, and no 
absorption appears in the first plot of Figure 2. 

Since data collection is degraded by effective area, instrument response, and background con- 
tamination (see Figure 2), we model the observed counts as independent Poisson variables with 
intensity 

observed _ instrument ( model effective \ , KapL-o-rnnnrI each 
intensity ~ response intensity area j odCKgiouna, energy channel 



or more formally, 

ii{e) = 5;M,,A,(0)d,- + Af(0^), for I G A 



(9) 



C = {1, . . . , L} where the L x J matrix M = {Mij} represents instrument response — a photon 
arriving in bin j has probability Mij of being detected in observed bin /, d = (di, . . . , dj) is the 
effective area of bin j, normalized so that maxj^jdj = 1, and \f{6^) is the expected counts due 
to the background which may be known from calibration in space or parameterized in terms of 0^ . 
As with J', £. may be any subset of detector bins. Generally the counts are also degraded by pile 
up (e.g.. Knoll 1989); see Figure 2. Here we ignore pile-up which is justifiable for low intensity or 
spatially diffuse sources (see the discussion in Section 5). 



^'^We use the maximum value of the effective area over the spectral energy range of interest in this stage of the 
analysis. This is only a matter of convenience and the full effective area variations are included in Equation 9. 
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In the next several sections, we describe the stochastic models for each of the sources of 
photons in turn. This includes both likelihoods which describe the sampling distribution of the data 
(parameterized by 9) and prior distributions which allow us to incorporate scientific information 
about the likely parameter values. As described below, the prior distributions are parameterized 
using the hyper-parameter (f). 



3.2. The Continuum 

The photon counts due to the continuum are modeled via a GLM (McCullagh & Nelder 1989), 
specifically a log linear model. That is, the log expected counts per keV per maximum effective 
area arc assumed to be a linear function of a set of independent variables, Xj" which in turn arc 
typically functions of Ej; hence the notation, f{0'^', Ej). In particular we model the counts in bin 
j due to the continuum, denoted Yj^ , as 

Yf ~ Poisson {dEj f{e^,Ej)), (10) 

i.e.ii, 

p(yf |0^) = e-'^fixfff/iYfy. with xf = f{e'',E^)dEy (ii) 

Here log f {6'^' , Ej) = X^^*^, independently for j G J, with 6*^ a {P^ x 1) vector parameter, 
a (1 X P^) vector of independent variables, and P*^ the number of parameters in the continuum 
model. Note that we are explicitly using a Poisson process for the photon counts as opposed to an 
often poor Gaussian approximation. 

The flexible framework of the GLM allows us to adjust the expected counts in bin j for any 
set of independent variables. For example, several standard continuum models are easily available. 
In particular, a power law model is obtained by setting Xj = (l,log(£'j)) for j e so that 

/(6>^, Ej) = e^?Ef = aEJ^ for j G J , (12) 

where the familiar form of the power law model in the last expression is obtained by identi- 
fying (a,/3) with (e^f,— ^2^). It is easy to generalize this to handle more complicated mod- 
els. A break in the power law (i.e., a change point) can be added at E^ by setting Xj = 
{l,\og{Ej),\og{Ej/E^)I{Ej > E^}), so that 

, r . f e''?Ef for E^ < E^ 

eoJo^eo_ec for^Gj. (13) 

I 6*^1 Ej^ ^ 3 for Ej > Ei, 



Here and in the remainder of the paper we suppress the conditioning on the initial information, I. That is, it 
should be understood that all distributions implicitly condition on I. 
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The factor E^, ^ ensures f{6 ,Ej) is continuous at Ej = E^. As a final example, we obtain an 
exponential continuum representing Bremsstrahlung emission by setting X.^ = (1, —Ej) so 

Ej) = e^'ie-^^'^ = ^e-^^/'^^ for j G J , (14) 

where (a,r) = (e^f -^fe^, 1/fc^^). 

It is convenient to assume the prior distribution on 6'-' to is multivariate Gaussian with a 
diagonal variance matrix. That is, 6p ~ l^{fip ,Vp) with (ff = {(//^, Vp), for p = 1, . . . , P^}. The 
hyper-paramctcr, cff' , is set by the user where ^i^ is a "best guess" of 6^ and t;^ is a measure (in 
squared standard deviations) of the error of this "best guess." Large values of Vp reflect little prior 
information for 9^ . 



3.3. Emission Lines 

Lines reflect deviation in the smooth spectrum due to the continuum because of photon emis- 
sions from various ions present in the source. In particular we model the energies of photons due 
to line A; G /C, denoted 1"^*^ as 

y^~N(/,u^) (15) 

i.e., 

p{Yt\ii\ v^) = _^^-(Yt-^^?/2v^ (16) 

independently for i = 1,...,N'^. Equation (15) represents a line with intensity normalized to 
one. The total line counts for a perfect instrument (i.e., with effective area everywhere equal to 
its maximum possible value) are denoted {N^, . . . ,N^) and assumed to be independent Poisson 
random variables, 

AT*^ £ Poisson(A'=) independently for k e IC. (17) 

Proper prior information for the lines and the continuum is important for a reasonable fit when 
the spectral model includes emission lines. In particular, prior information is especially important 
for relatively weak lines, since it is difficult to distinguish a weak line from a chance fluctuation 
in the continuum. Luckily such prior information is often scientifically forthcoming in the form of 
knowledge (e.g., laboratory measurements and physics theory) of probable sizes and locations of 
the various lines. We begin with the line location and width (actually the variance), {ji'^^v^) for 



-14- 



which priors are assigned independently for each line k E JC, 



^;'=^^andp(/|^;^)=NLo^4l 




(18) 



where Xuq ^ variable that follows the distribution with vq degrees of freedom. We interpret 
the hyper-parameter, = {^q,Vq, Kq,^^) using the mean and variance of the distributions in 
Equation 18. For example, 



(Recall that the units here are kcV for means and keV^ for variances.) Thus, the mean and variance 
of the prior for may be tuned using Vq and Vq; a small value of Vq results in a wide, relatively 
non-informative prior. Since the data are discrete, a priori we cannot allow the standard deviation 
of the line to become too small (say below the PHA bin width of the bin that contains the center 
of fj,'') since there is not information in the data about the width of a line which is narrower than 
one PHA channel. This is accomplished by truncating the prior distribution of v''. For the prior 
on /x*^, the mean and variance are given by /Ltg Kq; is the most probable location of the kih. 
line and Kq calibrates the uncertainty in the location of the A;th line relative to the width of the 
line. 

An alternative interpretation of the priors is in terms of additional hypothetical photons. 
Heuristically, the effect of the prior on /v/^ if were known would be the same as Kq photons all 
known to be from line k and equal to ^q . Likewise, the effect of the prior on is the same as 
adding photons with average squared deviation from the mean equal to v^. 

We now turn to the prior distribution on A*^ and set A*' ~ 7('?!'i) ^2)5''^^ (independently) which 
has mean (?!>i/02 variance 4'i/{(t>2)'^ ■ Roughly speaking, the 7 prior contains the same informa- 
tion as ^2 Poisson observations (with exposure equal to the source exposure) with a total oi 4>\ — l 
counts. Since the data consist of a single observation (for each bin), can be interpreted as the 
weight put on the prior relative to the data; 02 = 1 induces a prior as influential as the data in 
the absence of absorption, blurring, background, and lines. Thus, values of (/)2 ^ 1 are typically 
recommended for non informative priors. The hyperparameters (pi, can be interpreted as the prior 



We choose this prior distribution partially because it is the so-called conjugate prior distribution, i.e., the resulting 
posterior distribution is from the same family as the prior distribution (e.g., Gaussian with updated parameters). 
This property significantly simplifies model fitting with no cost in terms of the accuracy of parameter estimation. 

^^We choose a 7 prior partially because it is conjugate to the Poisson distribution. 



E(u^|0'=) = u^v^/iu^ - 2) for > 2 



(19) 



and 



Var(^;'=|0*^) = 2{iy^v^f{4 - 2f{iy^ - 4) for i^^ > 4. 



(20) 
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relative sizes of the lines. That is, (l>i/J2k^i prior proportion of Une photons from hne A;. 

We define the hyper-parameter for Une k as (f)'^ = {ij,q,Kq, u^, {o-qY-, 01) ^2); the last element is not 
indexed by k since it is constant for G /C. 

3.4. Absorption and Correction for Effective Area 

Prom the viewpoint of our statistical algorithm, both the telescope effective area and astro- 
physical absorption (e.g., absorption due to the ISM) are handled in the same way. These two 
processes act independently on individual photons and randomly prevent a (energy dependent) 
proportion of photons from being observed. The only essential statistical difference being that only 
the absorption process has unknown parameters. In particular, we suppose that the probability a 
photon is not absorbed (statistically speaking "censored") by either of these two processes is 

dju{e^, Ej), where log u(0^, Ej) = XfO^ for j G J, (21) 

where 6^ is a (P"^ x 1) parameter, is a (1 x P"^) vector of independent variables, and P'^ is 
the number of parameters in the absorption model, u{6^ ,Ej). As an example, simple exponential 
absorption can be written in this linear form with O'^ a scalar and = — 1/Ej, i.e., «(0^, Ej) = 
^-e^/Ej more complicated absorption models, X^ typically consists of a tabulated absorption 
function. 

The prior for 6"^ is multivariate Gaussian, 6^ ~ N(//p , Vp ), independently for p = 1, . . . , P"^. 
The prior is interpreted similarly to that for the continuum parameter 6^ . We, however truncate 
this prior to ensure exp{Xj^0"^} < 1 for each j, to ensure the proportion of photons not absorbed is 
less than 1. With appropriately chosen Xj^, this can be accomplished by assuming each component 
of is negative. 

3.5. Background 

We assume the availability of a separate observation containing background counts which can 
be used to model the background spectrum and correct the source spectrum. Rather than simply 
subtracting off (a scalar multiple of) the background counts, however, we account for the variation 
due to the Poisson character of the counts. In particular, we suppose the background count in PHA 
channel I is 

1^^ - Poisson(6lf ) independently for I € (22) 

where the unobserved quantity, Y^^ is the counts in PHA channel I that are due to background. 
We parameterize the prior for Of as 7(^/'fi, ^^2)' which we expect to be informative based on a 
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pure background exposure. In particular, a reasonable prior based on Y°^^'^ background counts 
would be j{Y°^^'^ + 1, r), where r is the background exposure time and area relative to the source 
exposure time and area. If y°^^'^ = 0, we generally replace it by some small number, such as 
0.25 in the prior. (This value should not be too small, since this would preclude the possibility of 
background counts in the corresponding channel in the same exposure.) 

In an extreme case, when the background is very well determined, e.g., via a very long exposure, 
we may fix Of = yj°'^^'^/r and discard the prior distribution; here we are effectively setting the 
prior variance to zero. Note this is not equivalent to subtracting off the background because we 
still allow for Poisson variability in the background counts that contaminate the source counts. An 
alternative strategy is to fit a parameterized model to the background. For example, we might 
assume F;^ ~ Poisson(Af (0^, Ei)), where log Af (0^, Ei) = Xf 0^ for ^ G £ with Xf a row vector 
of independent variables depending on the energy of PHA channel l,Ei. This allows the background 
counts to be modeled as a power law, broken power law, or any other log linear model. 

4. Applications 

In this section, we illustrate our methods and algorithms using two datasets. We first analyze 
ASCA/Sm data of high redshift (z=3.384) quasar S5 0014+813 (Elvis et al. 1994) to illustrate the 
various summaries available in a relatively straight-forward MCMC analysis. The second analysis 
involves an extremely low-count stellar coronal source (a Tr A) and illustrates the power of Bayesian 
methods to combine information from various sources and quantify the weak information available 
in this data. 

4.1. Quasar S5 0014+813 

A typical quasar X-ray spectrum can be described by an absorbed power law. A fluorescent 
iron line (Fe K-a) emitted at energy between ~ 6.4 — 6.8 keV if detected can be a signature of a 
reflection component and its ionization state (George et al. 2000). Quasar S5 0014+813 (Kiihr et al. 
1981) at redshift z=3.384 is among the highest X-ray flux quasars known with z ~ 3. S5 0014+813 
was observed with ASCA on 1993 October 29 with an exposure time of 22.8 ks in the SISO detector 
(Elvis et al. 1994). Here we apply our model to this data to illustrate the method and look for 
signatures of the iron emission line. 

The spectral data were extracted with the standard screening criteria (Elvis et al. 1994) and 
standard response matrices were used (ftp:/ /legacy.gsfc. nasa.gov/caldb/data/asca/sis/cpf/94nov9). 
We use all of the original 512 PHA instrument channels except the unreliable channels below ~ 0.5 
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keV and above ~ lOkeV. In addition we do not group any channels. (Channels are usually grouped 

in order to justify the use the default techniques with their Gaussian assumptions.) As is al- 
lowed with a Poisson model, we instead use only the original PHA bins. The source model included 
the exponential shape of Galactic absorption (see Section 3.5), and a power law continuum (i.e.. 
Equation 12) with a narrow emission line at 1.45 keV (observed frame; ~ 6.7keV rest frame). We 
accounted for background using a background Poisson process with intensity equal to the (rescaled) 
background counts in each PHA channel. Flat priors were used on all model parameters. 

To estimate the four model parameters (i.e., the power law, normalization, and exponential 
absorption parameters and the equivalent width-*^"* of the line) a sample from their posterior distri- 
bution was obtained by running three MCMC chains using dispersed starting values. The chains 
showed excellent mixing (as measured with \fA, see Appendix A. 2) after 2000 draws. In the Monte 
Carlo evaluation, the second half of each of the chains were used along with an additional run of 
2000 draws from each chain, for a total of 9000 draws. 

Summaries of the model fit appear in Table 2, Figure 3, and Figure 4. The parameter esti- 
mates are posterior means computed using a transformation which makes the marginal posterior 
distributions more symmetric and hence the posterior mean a more informative summary (i.e., 
In(Normalization) and sqrt (Equivalent Width)). In particular, if we represent the draws of the nor- 
malization parameter as {^[jj, t = 1, ■ ■ ■ , 9000}, the point estimate of this parameter was computed 



the geometric mean. The credible intervals are computed using the 0.025 and 0.975 quantiles of 
the draws and are invariant to (monotonic) transformations. Pairwise credible regions appear in 
Figure 3. The scatter plots illustrate the regions of highest posterior probability by plotting the 
Monte Carlo draws — Pr(0 € R) is approximately equal to the proportion of points in that region. 
The grey-scale images give Monte Carlo estimates of the (darker) 50% and (lighter) 90% marginal 
posterior regions. The grainy character of the images is due to the Monte Carlo approximation. 
Even with this relatively large data set and with the use of transformations, the non-Gaussian 
character of the posterior is evident. We expect that higher dimensional marginal posterior distri- 
butions are even less Gaussian in character. Figure 4 compares the fitted source model corrected 
for effective area and absorption with the PHA counts and illustrates the estimated continuum and 
the stability of this estimate. 



The equivalent width is defined as A'°//(0'^, /x'"). 



as 




(23) 
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4.2. Hybrid-Chromosphere Supergiant Star a TrA 

Unlike the simple power-law spectrum of the quasar in the previous section, stellar coronal 
spectra are complicated by a Bremsstrahlung continuum and the presence of numerous emission 
lines. Such complex spectra are much more difficult to model, and in addition, the intensity of 
the Bremsstrahlung continuum drops exponentially at high-energies, resulting in very few counts. 
Analyzing such spectra is however crucial to the understanding of coronal structure, mechanisms of 
coronal heating, etc. A case in point is the corona of the Hybrid supergiant star a TrA (HD 150798, 
K4II, B-V=1.44, V=l"*.92), which shows evidence of both strong magnetic activity as indicated 
by X-ray emission (Brown ct al. 1991; Kashyap et al. 1994) and stellar outflow seen in absorption 
profiles (Hartmann et al. 1981). X-ray observations with the ROSAT/PSPC (Kashyap et al. 1994) 
indicate that its corona is dominated by transient, unstable plasma that is confined by magnetic 
loops that are closed on short length scales (Rosner et al. 1994). Constraining the maximum 
temperatures present in the corona is therefore of primary importance. Here we use data obtained 
with ASCA, at higher energies than ROSAT, to model the spectrum. The low number of counts 
detected at high energies make this spectrum difficult to analyze by traditional means, and we must 
bring to bear the full power of a hierarchical Bayesian analysis in order to constrain the maximum 
temperatures present in the corona. 

a TrA was observed with ASCA in March 1995 for 34 ks. During this observation, the 
source exhibited no flares. The count rate was steady, and corresponded to the quiescent state 
identified with ROSAT. 

We model the high-energy region of the ASCA spectrum (2.5 - 7.5 keV) as a combination of 
a Bremsstrahlung continuum 

Norm p/i. T / ,N 

__^-E/ksT^ (24) 

where T is the electron temperature, E is the energy in keV, and is the Boltzmann constant, and 
Norm is a normalization; and a number (~ 10) of narrow emission lines located at the positions 
of known strong lines whose widths and locations"*^^ are fixed, but intensities are allowed to vary. 
As is our convention, the units of Norm is counts per keV per maximum effective area. Because 
of the low counts we fit a power law to the background. 

We apply the above model to SISO data (~ 28 ks) and the combined data from the 2 CIS 
detectors (~ 33 ks in each). Note that the very low counts present in these data ('^ 150 counts 
in SISO, ~ 300 in CIS) preclude any "traditional" analysis: it is only by using the full Bayesian 



^^Line location can be known only to the resolution of the instrument, and hence each of the model lines represents 
the sum of a large number of lines within the resolution element; we find that we only exclude < 5% of the line flux 
by this approximation in the energy range considered. 
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machinery that we can derive useful results from such data. 
We carry out the analysis in two steps: 

1. Choose highly non-informative priors on the parameters to analyze SISO data: for the prior on 
normalization, p ^ j ^ such that it lies in between 10~^ and 10~^ with a 90% probability; 
on temperature, p {-^) such that it is always positive and also is nearly flat in the temperature 
range of interest; and on line intensity, p [X^] such that a priori all the lines have the same 



intensity, and the maximum total counts due to lines is 100 (the total sourcc+background 
counts for the SISO observation is only 154; atomic emission line models indicate that for 
the temperature and energy range of interest, 100 corresponds to the maximum possible 
contribution to the spectrum from lines). We choose Gaussian forms for the first two and a 
Gamma prior for the last; these priors are illustrated as solid lines in Figure 5. Thus, 



p(^ln(^^^)) = Ar(/x = -9.21,c7 = 5.58) (25) 
kf) 



pl-L) = N{ii = 0.95, a = 2.5) (26) 



7((^^ = 0.11, (f)^ = 0.0033) for A; = 1, ... 10 . (27) 



2. Use the posterior resulting from the above step to define more informative priors to analyze 
GIS data. These priors also correct for the difference in exposure time and average effective 
area between the SISO and the GIS data. The posterior variances from the initial analysis 
were increased somewhat when computing the priors for the second analysis. These priors 
are illustrated as solid lines in Figure 6. Thus, 

p(^ln(^^^^^]] = A^(// = -9.97,c7 = 1.74) (28) 



VT J 

kT 



p(-i-) = Ar(^ = 0.41,(7 = 0.43) (29) 
j{(f)'i = 0.12, (/)^ = 0.025) for k = l,...,K. (30) 



We ran three Markov Chains in each analysis to obtain draws from the posterior distribution of 
{hi{N orm I \/T) , 1/kT, }},..., A^°^ . In both analyses there was excellent mixing after 6000 draws 
and we used the second half of each chain for a total of 9000 Monte Carlo draws. 

The results of the analysis are shown in Figures 5-8. (In the figures the parameter $7 refers 
to the proportion of source photons from the lines and A = J2kLi We find that the plasma 
temperature is ~ 19>ii x 10^ K (Figure 5). Such a large value (cf. ~ 2 x 10^ K in the quiet Solar 
corona) clearly lends credence to the idea that the corona on a Tr A is dominated even in quiescence 
by flare-like events. 
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As a byproduct of our analysis, we also obtain the flux in the modeled lines relative to the 
continuum. In principle, this allows us to constrain the mctallicity for the first time in the corona 
of a TrA by comparing the observed ratio of the line and continuum fluxes^^ with that derived 
from thermal emission models computed over the same temperature range (Drake & Kashyap 1998; 
Kashyap et al. 1998). The photospheric metallicity (Taylor 1999) is [Fe/H] = 0.3, and we derive for 
the coronal metallicity [Fe/H] = 0.4^Qg where the quoted range represents posterior deviations at 
1 a. While the uncertainty on our measurement is quite large (it is essentially unbounded at high 
metallicity), it is encouraging that the corona does not appear to be metal abundance deficient (see 
Drake 1996). 



5. Discussion 

The power of the Bayesian methods illustrated here lies in their ability to combine information 
and to directly model the highly structured hierarchical features of the data — both in a principled 
manner. These features are illustrated in the a TrA example. First, by combining information from 
several detectors, we are able to extract information from the data regarding the plasma temper- 
ature. More generally, Bayesian methods allow for the incorporation of various forms quantifiable 
prior information through the prior distribution. Of course, results are then conditional on the 
prior information — if these priors are not trusted, the conclusions cannot be trusted cither. On the 
other hand, if the prior information is accepted as reasonable, the posterior distribution should be 
accepted as a conglomeration of prior scientific information and the data. Second, the extremely 
low counts in the a TrA data, along with many free parameters (ten emission line intensities and 
two continuum parameters) illustrate a situation in which methods based on the Gaussian distri- 
bution and the central limit theorem are simply without justification. Methods which account for 
the Poisson (e.g., highly variable) character of the data have a sound mathematical basis and, in 
contract to standard methods such as fitting, are equipped to handle such data. 

The hierarchy in the model described in Section 3 can be extended to account for various more 
complicated features in the data, e.g., absorption lines, pile-up, and joint spatial, spectral, and 
temporal structure. Dealing with pile-up is perhaps the most important outstanding data-analytic 
challenge for Chandra. Conceptually, however, there is no difficulty in addressing pile-up in a 
Bayesian framework. After accounting for other features in the data such as instrument response, 
background, and absorption, we simply need to separate the observed counts into multiple counts of 
lower or equal energy based on the (current draw of the) spectral and spatial model. The difficulty 



^® Incompleteness in atomic line databases ((sec Brickhouse 1998)) contribute to an error of < 5% on the line-to- 
continuum ratios calculated here. They are negligible compared to the measurement error. 
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lies in computation. Simply enumerating the set of photons that could result in a particular 
observed event, let alone their relative probabilities is an enormous task. Thus, we believe there 
is great promise in Monte Carlo techniques which if carefully designed, can automatically exclude 
numerous possibilities with minute probability. Although there remains much work to be done, 
Bayesian methods in conjunction with MCMC algorithms offer a practical and innovative solution 
to many outstanding data-analytic challenges in astrophysics. 
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A. Markov Chain Monte Carlo Methods 

A.l. The Data Augmentation Algorithm 

The Data Augmentation algorithm is designed to obtain a sample from the posterior distribu- 
tion for use in Monte Carlo integration. The strategy of the algorithm is to embed the posterior 
distribution, p{0\Y), into a distribution in a large space, Y™'^|Y). If we can obtain a sample 
from this second distribution, we need only discard the sampled values ofY™'** to obtain the desired 
sample from the posterior. The quantity Y™'' can be any unobserved quantity; it is referred to 
as "missing data" for historical reasons. For clarity we denote the observed data Y"*^^ and the 
augmented data Y^"g = (Y°'^% Y^^^). In order to obtain a sample from Y™^| Y°^"), the Data 
Augmentation algorithm uses an iterative sampling scheme that samples first Y™'^ conditional on 
the model parameters and Y°^^ and second samples the model parameters given Y^"^. Clearly, the 
algorithm is most useful when both of these conditional distributions are easily sampled from. The 
iterative character of the resulting chain naturally leads to a Markov chain, which we initialize at 
some starting value, 0[o]. For t = 1, . . . ,T, where T is dynamically chosen, we repeat the following 
two steps: 

Step 1: Draw YjJ^ from p(Y^"S|Y°bs, 
Step 2: Draw 6^] from p{e\Y'^^'^). 

Under certain regularity conditions (see Meyn & Tweedie 1993; Roberts 1996; Tierney 1994, 1996, 
for details) the stationary distribution of the resulting Markov chain is the desired posterior distri- 
bution, i.e., for large t, O^^-^ approximately follows p{6\Y°^^). 

To illustrate the utility of the Data Augmentation algorithm, we return to the simple back- 
ground contamination model introduced in Section 2.1. The choice of Y^'^s ig clear in the example; 
we set Y^^s = {Y, , Y^}, where Y is the total counts, Y^ is the unobserved source counts from 
the source exposure, and Y^ is the counts from the pure background observation. (I.e., we can 
consider Y^ to be the missing data.) With this choice of Y^"s^ both p(Y^"g|Y°''% 0) and ^?(0|Y^"e) 
are easy to sample and thus the Data Augmentation algorithm is easy to use; here Y°^® = {Y, Y^} 
and6> = (A^,A^). Given some 0[o] = (-^[0]) -^^j) the two steps of the Data Augmentation algorithm 
at iteration t become 
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Step 1: Draw Yj| from 



~ binomial! y, 



I.e., 



Step 2: Draw 



^aug 



7(a^ + y-y^,/?^ + i) , 



I.e., 



vaug 



) = (Af,(/?- + i))^"'""-"^\-a(^^+^)/(A-r(a- + y-y^)), 



(Al) 



(A2) 



(A3) 



(A4) 



where and are typically chosen using the pure background observation as described in 
Section 2.1, and 

4lKr - i{a' + Y^I5' + l). (A5) 
In the first step, we stochastically divide the source count into source counts and background counts 

based on the current values of and A*^. In the second step we use this division to update A^ and 
A'^. Markov chain theory tells us the iteration converges to the desired draws from the posterior 
distribution. 

By selecting a starting value and iteratively sampling according to Equations Al, A3, and 
A5, we obtain a Markov chain which delivers a dependent sample from the posterior distribution 
upon convergence. In the next section, we use the Data Augmentation algorithm to illustrate the 
important practical issues of selecting starting values, detecting convergence, and accounting for 
the dependency in the sample. 



A. 2. Starting Values, Convergence, and Multiple Chains 

An important and difficult aspect of MCMC methods in practice is ascertaining convergence 
to stationarity. Since the stationary distribution of the Markov chain is the posterior distribution 
on interest, we can consider {0[t], t > Tq} to be a (dependent) posterior sample, which can be used 
for Monte Carlo integration. Thus, determining 0[o] ^'^d Tq is critical for valid inference. There is 
a large and growing literature on these related subjects and wc refer interested readers to recent 
texts on the subject by Gelman et al. (1995), Carlin &: Louis (1996), and Gilks ct al. (1996), as 
well as the review article on convergence by Cowles & Carlin (1996). Here we briefly outline the 
approach that we find most fruitful. 
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As proposed by Gelman & Rubin (1992), we suggest running multiple Markov chains with a 
variety of starting values spread throughoTit the parameter space. This is a useful procedure since 
a single chain can appear to have converged when actually it has only settled temporarily in one 
region of the parameter space. This is illustrated with the Markov chain in Figure 9. The three 
chains show the draws of a variance parameter for a random-effects model (for details see van Dyk &: 
Meng 2000). Note that although chain 3 appears relatively stable it is far from convergence during 
the first 10,000 draws. This is evident when it is compared with the other chains, but less so when 
we look only at the beginning of chain 3. It is recommended that the starting values for the several 
chains be spread broadly in the parameter space (relative to the region of high posterior probability) . 
This can often be accomplished by roughly mapping the posterior, for example, using estimates 
and errors based on the chi square estimates, posterior modes, or maximum likelihood estimates. 
(See van Dyk (2000) for details on the computation of posterior modes and maximum likelihood 
estimates for our spectral model.) Once such "over dispersed" starting values are obtained, we can 
run the several chains until all converge to the same region of the parameter space. (There may be 
more than one mode in the posterior, in which case the chains may converge to different modes, 
i.e., different regions of the parameter space.) Gelman Sz Rubin's (1992) statistic measures the 
relative size of the total variance in the draws of a univariate function of the parameter and the 
average within chain variance of the same function, i.e.. 



where B is the between chain variance, W is the within chain variance, and T is the number of 
draws. If the variance within each chain is as great as the total variance in all the draws, i.e., 
is near one, then we can be confident that all the chains have converged to the same region 
of the parameter space. Typically we compute \/R using the last half (or two thirds) of each of 
the chains. Once an acceptable level of is obtained (say below 1.2) we omit the first half (or 
third) of the chain in all further analysis. If we have several starting values which cover a large 
enough region of the parameter space, we can be confident that the chains sample all areas with 
high posterior probability and thus the Monte Carlo approximations are unbiased estimators of the 
quantities they estimate. 

The variance of the Monte Carlo approximations is a function of the posterior variance of 
the quantity being approximated, the posterior sample size (i.e., T — Tq), and the autocorrelation 
function of the Markov chain. Typically Monte Carlo errors are small relative to the posterior 
variance with several thousand posterior draws and thus are of little consequence. Monte Carlo 
error can be quantified by repeating the analysis for the first half and second half of the Markov 
chain and noting if the results are substantively different. See Roberts (1996) and the references 
therein for details and extensions. 




(A6) 
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A.3. The Gibbs Sampler 

In this and the next section we describe two additional MCMC methods which are designed to 
delivers a sample from the posterior distribution and are often useful when the Data Augmentation 
algorithm is not practical. The Gibbs sampler can be viewed as an extension of the Data Augmen- 
tation algorithm in which we wish to sample from p(0|Y°^**), and the vector, 0, can be viewed as 
a combination of model parameters and "missing data" . (In many instances, there is no "missing 
data".) We partition 9 into ...,6p), where Op may be a scalar or vector quantity for each p. 
The Gibbs sampler, again starts with some starting value 6^ and at iteration t samples according 
to the following conditional distributions: 

Step 1: Draw ^ Y^^s), 

Step 2: Draw (02)^ ~ p(e2|(0-2)M, Y°bs), 

Step P: Draw (0p)[,] ^ p(0p|(0_p)[i], Y°bs), 

where (0_p)[t] = • • • , {9p-i)[t], {Op+i)[t-i], ■ ■ ■ , (0p)[t_i]). That is, we draw each component 

of in turn conditional on the current values of the rest of 6 and the data. 

The advantage of the Gibbs sampler over the Data Augmentation algorithm is that in many 
settings additional conditioning results in simpler draws. The disadvantage is that the resulting 
Markov chains tend to have higher autocorrelation and are slower to converge to stationarity as P, 
the number of steps per iteration, increases. 

A.4. The Metropolis-Hastings Algorithm 

As a final extension, we consider the case when one (or more) of the steps in the Gibbs 
sample involves a conditional distribution that is not easy to sample. The Metropolis-Hastings 
algorithm (Metropolis & Ulam 1949; Metropolis et al. 1953; Hastings 1970) replaces the conditional 
distribution by some convenient ""jumping rule" which approximates the conditional distribution. 
A proposal draw is sampled according to the jumping rule and is either accepted or rejected (in 
which case, the Markov chain is fixed at the previous draw) according to a rule that maintains the 
desired stationary distribution (see, e.g., Gelman et al. 1995, for details). 
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B. Details of the MCMC Algorithm 

B.l. Data Augmentation 

The algorithms used to fit the model described in Section 3.1 rely on the method of data 
augmentation. In this section, we detail the layers of the data-augmentation scheme we use. We aim 
to construct an idealized data set for which model fitting is a relatively easy task. That is, given the 
augmented data, we can easily sample the model parameters. Likewise, given the model parameters, 
we can easily sample the augmented data and thus we can construct a Data Augmentation algorithm 
as described in Appendix A.l. Suppose, for example, a data set uncontaminated by background 
or instrument response were available. Clearly, model fitting would be easier. We define an even 
larger data set that contains the unbinned, true, and blurred energies of all photons that would have 
arrived at the detector if there had been no absorption and if we were using a perfect instrument 
with the effective area equal to its maximum value over all energies used. This data set also includes 
a variable indicating absorption and loss to reduced effective area^^ and a variable indicating the 
source of each photon, i.e., background {B), continuum (C), and each of the K line profiles; the 
set of sources is denoted S = {B, C, 1, 2, . . . , K}. This idealized data set is summarized in Table 3. 

The data augmentation scheme is illustrated in Figure 10 in which squares and circles represent 
observed and unobserved ("augmented") quantities, respectively. Given the model parameter 0, 
we obtain a sample set of photon energies, = {Yf, . . . for s G 5 (see the third column 

of Figure 10) represented the undegraded "augmented" data; is the total count for source s. 

(As a mnemonic device, more dots in the accent above Y signifies further removal of a quantity 
from actual observable quantities.) Here, contains the exact energy of all photons attributed 
to line k before absorption, with maximum effective area, and no background contamination. (The 
background photon energies, Y^, do not appear in Figure 10 because we model the detected counts 
(e.g., in PHA channels) rather than true counts, see Section 3.6.) The first two columns of Figure 10 
represent the hyperparameters and model parameters detailed in Sections 3.2-3.5. 

The array of energies represented by Y'' are binned into instrument-specific energy bins to 
obtain a sample spectrum, Y* = (Yf, . . . , Yj)' (see the fourth column in Figure 10). In particular, 

Yf = I{YI e Bj}, for j G J and s G S, (Bl) 
i=l 

where Bj is the jth energy bin. The first plot in Figure 2 illustrates the undegraded counts from 
the continuum and lines, Y^^ = J2s&s\B artificial data set, where the notation S \ B 

indicates set subtraction, i.e., the set S with B removed. The solid line represents E(Y'-'^|0) which 



Absorption and effective area are handled together, so we need only one indicator variable, see Section 3.4. 



-27- 



equals the term in square brackets in Equation 8. Due to absorption and effective area, a portion 
of these photons are not detected. The sample counts after absorption (and accounting for effective 
area) are depicted in the fifth column of Figure 10, by Y* = (1^, . . . ,Yjy with 

Yf = I{Yi e Bj}{l - Zf) for j ej andse S; (B2) 

1=1 

as described in Tabic 4, Z,f^ is one if photon i is absorbed and is zero otherwise. The second plot in 
Figure 2 represents Y'^ with E(Y^^^|0) = \j{Q)dj plotted as the solid line; see Equation 8. The next 
two circles in Figure 10 represent the adding of sources and the blurring (i.e., instrument response) 
process. In particular, Y"*" = YliseS\B^'^ ■ '^^^ blurred data, Y"*" = iY-^ , . . . ,Y^)' , is a stochastic 
function of Y+, (i.e., a multinomial distribution^^) 

Y+ - ^ multinomial(l^+ , M^- ) , (B3) 

where Y+ = {Y-i ., . . . ,Yj')' and Mj is the jth column of M; Y+ appears in the third plot of 
Figure 2 with 'Ej{Yj^\6) = ^^^j Mij\j{0)dj. The counts due to background contamination are 
denoted Y^ = (Yf,...,Y^)' and the observed data are denoted Y"^*^ = (y]°'^^ . . . , y^^^)' with 
y^obs = + Yj^ for / G C\ Y"*^® is illustrated in the final plot of Figure 2 and has expectation 
^(6*); (cf. Equation 9). 



B.2. The Algorithms 

In this section we present the details of the MCMC algorithm which we use to sample from 
the posterior distribution for our spectral model. We use an algorithm which alternately draws 
the "missing data" given the model parameters and the parameters given the "missing data". 
Both draws are conditional on the observed photon counts and the prior hyperparameters, cf) = 
{(f)^, (t)%seS}. In particular, we define two groups, (1) the augmented data, Y^^s = {Y°''^ Y-^, Y+, Y, Y, Y}, 
where Y = {Y*, s ^S\B} are the binned true energies, after absorption and accounting for effec- 
tive area, Y = {Y'^, A; G /C} are the binned true energies and Y = {Y'^, A; G /C} are the (unbinned) 
true energies and (2) = {6^, 6^, s E S} consists of the various model parameters. Using Bayes's 
Theorem, we are able to derive the necessary conditional distributions, which are described below. 

First, we draw Y^"^ from p(Y^"S|Y°^*, 0); the draw is broken into the following five steps. 



^®The multinomial (n, p) distribution is a distribution for nonnegative integer valued random vectors and generalizes 
the binomial distribution. In particular, a vector randomly selected from this distribution sums to n and its expected 
value is np, where p is a probability vector which sums to one. 
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Draw 1, Step 1: Independently separate the background counts, 



Y°'^^ e - binomial ( y^°'^^ ) , for / G (B4) 



Draw 1, Step 2: Restore the blurred the photons, 
Y 



where Y+ = 1^°^" - y^^. 

£)roi/; J, Step 3: Independently separate the counts into line and continuum counts, 

(y^,yS...,Y^) |Y+,Y^,Y°'^^6> ~ (B6) 

multinomial Y^+, ^ ^ ' , y for j G J, B7 

where = Pj{ij!',v^). 

Draw 1, Step 4: Independently restore the absorbed counts in the lines, 

y/ Y, Y^, Y°^^ 6> ~ y/ + Poisson (\''p'^{l - dju{e'^, Ej))^ , (B8) 
for j E J and k e K.. 

Draw 1, Step 5: Independently de-round the photon energies from the lines, 

Y,Y,Y^,Y°''^6> ~ iV(/,i;*), truncated to S^- (B9) 



for i = 1, . . . , y_f!: and k G IC with Y^^ = X^jej" ^j- We note that this draw is omitted for line k if 
v'' is fixed at zero, i.e., if line is a delta function. (In this case /x'^ is not fit by the algorithm). 

Second, we draw 6 from p(0|Y'^"^, 0), taking advantage of conditional independence among 
several vector components of 0. 

Draw 2, Step 1: Independently draw the background model parameters, 

ef Y-s, ^ 7(</'& + Yf , </>f2 + 1). (BIO) 



Draw 2, Step 2: Draw the variance and mean independently for each line profile 
.k 



V 



Y^"g,0~ (Bll) 
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kiyk 



i=l 



4 + 



(B12) 



V 4 + ^+ 



(B13) 



where = X^j^^ I'f • Again, this step is omitted for hne k if it is assumed to be a delta function. 

Draw 2, Step 3: Draw the hne intensities A'^, independently for each k 



^fcjYaug^ ^ d ^^yk + <^fe^ 1 + independently for A; G /C. 



(B14) 



Draw 2, Step 4: Draw the parameters for the GLM for the continuum and absorption models. For 
this final step, we condition only on Y°^^, Y^, Y+, and Y+, rather than Y^"s. We expect this 
substitution to improve the rate of convergence of the sampler. Because the conditional distribution 
is not from a standard family, we use a Metropolis Hastings step. In particular, we note that 



6>, A ~ Poisson (^dj exp(X/6>^) ^ A^j j for A; G /C 



(B15) 



Yf e,X^ Poisson {dEjdj exp(X^6>^ + X/6>^)) for j G J, (B16) 

where A = (A^, .... A^). Since given {0^, . . . , 0^) the log of each Poisson parameter differs from a 
linear combination of 0*^ and 6^ by a known constant, the conditional posterior mode can easily be 
computed using a minor modification of an Itcratively Reweighted Least Squares algorithm (e.g., 
Thistcd 1988). This algorithm can also account for the prior information described in Section 3.2 
and reports the curvature of the log posterior at the mode. A multivariate t-distribution with four 
degrees of freedom with the appropriate mode and (perhaps inflated) curvature can be used as a 
jumping distribution to generate a proposal for the next sample from the conditional distribution. 
The relative mass of the jumping distribution and actual conditional distribution of 6 and at 
the previous draw and proposed draw are combined to determine if the proposal should be accepted 
or rejected (in which case the previous draw is reused). Several (5 - 10) proposals are drawn at 
each iteration. Wc note that the same procedure can be used to fit a GLM to the background 
counts (as was done in Section 4.2). 

Although the MCMC methods detailed above may seem inhibiting as a whole, each of the 
required steps are quite simple. The power of the MCMC methods described here (e.g., the Gibbs 
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sampler) lies in their ability to break complicated model fitting tasks into a succession of relatively 
simple tasks. Our general strategy is to use Bayes's theorem to derive a posterior distribution 
which hierarchically accoTints for the complexity both in the posited model and in data collection. 
We then use modern statistical algorithms which devolve model fitting into a sequence of relatively 
simple steps. We believe this is a powerful strategy for dealing with the ever increasing power and 
sophistication of today's astronomical instruments. 



C. Internet Resources 

There are several internet sites where one can find papers describing Bayesian methods and 
related software. The MCMC preprint service, 

http : / /www . mcs . surrey . ac . uk/Personal/S . Brooks/MCMC, 

and STATLIB, 

http : //lib . Stat . emu . edu 

are both large general statistical sites that offer various software, preprints, and links that may 
be of interest to astrophysicists. Three sites (that we know of) aim specifically at the interface of 
astrophysics and statistics: 

http : //www . f as . harvard . edu/ ~vandyk/astrostat . html 
http : //astrosun . tn . Cornell . edu/staf f /loredo/bayes 
http : / / www . astro . psu . edu/statcodes 
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Fig. 1. — Combining information. The figure illustrates the combination of the information con- 
tained in the data and the prior into the posterior distribution. The less informative dotted prior 
has less influence on its (dotted) posterior, which matches the low source count more closely than 
does the solid posterior. The joint posterior indicates the region of high posterior probability for 
both parameters under the non informative prior for A'^. 
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Fig. 2. — The Degradation of Counts. The figure illustrates the various physical processes which 
significantly degrade the source model and result in the observed PHA counts. In particular, an 
artificial data set is used to illustrates (1) the absorption of (mostly low energy) counts, (2) the 
blurring of spectral features due to instrument response, (3) the shadows caused by pile-up, and (4) 
the masking of features due to background. The solid lines represent the assumed model (in the first 
three plots) and the '+' sign the simulated data. The first plot illustrates the counts per maximum 
effective area per total exposure time per bin; the remaining plots illustrate degraded counts per 
effective area per total exposure time per bin. Note that the effects of pile-up are included here 
for the sake of completeness; we do not deal with this aspect of the analysis in this paper. The 
symbols in the upper right of each plot are defined in Appendix B.l. 
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Fig. 3. — Posterior distributions of pairs of parameters obtained via MCMC. The plots show 
pairwise marginal posterior distributions for the model parameters in the analysis of Quasar S5 
0014+813. The plots in the upper right are scatter plots of the Monte Carlo draws and indicate 
areas of highest posterior probability. The plots in the lower left are gray-scale images of the 
Monte Carlo approximations to 50% (darker) and 90% (lighter) credible regions. The text along 
the diagonal labels the axes for each of the plots. 
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Fig. 4. — The Quasar S5 0014+813 model fit. This plot gives an overview of the fitted model. The 
first panel compares the fitted source model (corrected for effective area and absorption, but not the 
instrument's photon response matrix) with the observed PHA counts. The second gives the residual 
for each PHA channel, which were computed by subtracting off background and standardizing by 
the model standard deviation. The final plot illustrates the stability of the continuum. 
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ln(norm/sqrt(T)) 1/(k*Temp) sqrt(Lambda) sqrt(omega) 




-12 -8 0.0 1.0 2 4 6 8 0.0 0.4 0.8 

95% CI: (7.41e-06, 0.000215) 95% CI: { 1 .12 , 17.2 ) 95% CI: ( 3.1 1 , 79.2 ) 95% CI: { 0.0133 , 0.397 ) 

mean on original scale: 3.78e-05 mean on original scale: 2.37 mean on original scale: 23.2 mean on original scale: 0.141 

Fig. 5. — Using ASCA/SIS to compute the prior. Here, Lambda is the expected model counts from 
lines and omega is the ratio of the total counts in the lines to the total counts in the spectrum (cor- 
recting for the effects of absorption and instrument response). Transformations of the parameters 
that produce the distribution nearest to the Gaussian are displayed. The listed means and credible 
intervals, however, refer to the original parameters. The solid line in these plots represent the 
relatively diffuse priors used to compute the posterior distributions represented by the histograms 
based on the SISO observation. These posterior distributions were in turn used to choose the priors 
for the GIS data after some dispersion was added, as represented by the dotted curves. We do not 
specify the prior for the proportion of source photons from the lines, Q, but rather this prior is 
implied by the other priors. The solid line is an approximation based on sampling from the prior 
and distributing the SISO counts to the continuum and lines after correcting for background. 



ln(norm/sqrt(T)) 1/(k*Temp) sqrt(Lambda) sqrt(omega) 



d 



d 



o 
d 




-12 -8 

95% 01: ( 9.24e-06 , 0.000357 ) 
mean on original scale: 7.1e-05 




95% CI: (0.959, 5.55) 
mean on original scale: 1.83 



d 



o 
d 




5 10 15 

95% CI: (6.2, 93.3) 
mean on original scale: 37 




0.0 0.4 0.8 



95% 01: ( 0.0368 , 0.547 ) 
mean on original scale: 0.226 



Fig. 6. — Some marginal posterior distributions. Using the priors computed with SISO data (solid 
lines) we fit the source model to the GIS data. The resulting marginal posterior distributions are 
illustrated here using normalizing transformations. The estimates and credible intervals are on the 
original scales. 
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Fig. 7. — Some bivariate marginal posterior distributions. These plots are as described in Figure 3 
and illustrate pairwise credible regions for the various model parameters. Again the text along the 
diagonal labels the axes for each of the plots 
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Fig. 8. — The model fit. These plots are as described in Figure 4 - note the the instability of the 
continuum due to the low counts. 
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Fig. 9. — Several chains from a random-effects model. Notice that chain 3 appears to have converged 
during the first 10000 iterations. Comparison with chain 1 and chain 2, however, makes it clear 
that chain 3 did not converge until after iteration 10000. 
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Caption for Figure 10: A graphical representation of the data-augmentation scheme. Here 
represents hyperparameters, 6 model parameters, Y true photon energies, Y binned energies, 
Y binned true photon energies after absorption accounting for effective area, Y source counts 
in PHA channels, Y°^^ the observed counts, M the instrument response matrix, d the effective 
area vector, X"^ and X*^ independent variables describing absorption and continuum respectively, 
circles represent unobserved quantities, and squares observed quantities; details of the subscripts 
and superscripts are given in the text. The figure illustrates the interplay of the various model 
parameters, hyperparameters, observed quantities, and data augmentation. As an example, the 
first arrow in the row labeled 'background' corresponds to the relationship between the background 
hyperparameters, 0^, and the background intensities, 6^, e.g.. Of ~ 7(0fu<?^f2)- The second 
arrow corresponds to the Poisson nature of the background counts; see Equation 22. The final 
background arrow illustrates the background contamination of the observed PHA counts. 
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Terms 

absorption 

Bayes's Theorem 

conjugate prior distribution 

credible interval 

data augmentation 

data augmentation algorithm 

effective area 

equivalent width 

gamma distribution 

GLM or generalized linear model 

Gibbs sampler 

hierarchical model 

hypcrparameters 

improper distribution 

MCMC or Markov chain Monte Carlo 

marginal distribution 

maximum effective area 

model 

Monte Carlo integration 

multinomial distribution 

non-informative prior distribution 

nuisance parameter 

observed-data model 

Poisson distribution 

posterior distribution 

prior distribution 

PHA or pulse height amplitude 

source model 



Defined and/or Discussed 
Sections 3.1, 3.4 
Equation 1 
Footnote 10 
Section 2.1 
Sections 1, 2.3, B.l 
Sections 2.3, A.l 
Footnote 3 
Footnote 12 
Footnote 7 

Footnote 4, Section 3.2 
Section 2.3, A. 3 
Footnote 5 
Section 2.1 
Footnote 8 

Sections 2.3, A.l, A.2, A.3, A.4 

Section 2.1, Equation 6 

Footnote 9 

Section 1 

Section 2.2 

Footnote 16 

Section 2.1 

Section 2.1, Equation 6 
Section 1 
Footnote 2 
Equation 1 
Equation 1 
Footnote 1 
Section 1 



Table 1: Index of various terms discussed and defined in the paper. 
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Parameter Estimate 95% Interval 

Power Law 2.23 (1.90, 2.58) 

Normalization 2.47e-3 (1.37e-3, 4.55e-3) 

Absorption 2.05 (1.43, 2.71) 

Equivalent Width 0.0282 keV (0.0023, 0.0788) keV 



Table 2: Fitted values and credible regions for the Quasar data. 



Variable 



Notation Range 



The photon energy Yi Positive, measured in keV 

Indicator for Background 1 for background photons 

for other photons 
Indicator for Continuum Zf 1 for continuum photons 

for other photons 
Indicator for Line k, Zf 1 for photons from line k 

for k = \, . . . K for other photons 

Indicator for Absorption Zf^ 1 for absorbed photons 

for other photons 



Table 3: Variables associated with each photon, iov i = 1, . . . ,N . 
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Variable 



Notation Range 



The unbinned energies 
The binned energies 
The binned energies 
after absorption 

The bhirred PHA counts 
without background 
The observed data 



vobs 



Positive, measured in keV, s E S 
Counts for j e J',s e S 
Counts for j G JT", s G iS 

Counts for Z G £ 

Counts ioT I e jO 



Table 4: Summary statistics for the spectral model. 



